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1. INTRODUCTION 

One of the main diseases that cause death in the world is cancer [1], [2]. These diseases can attack 
all parts of the body [3]. One of the main causes of cancer-related deaths worldwide is pancreatic cancer. In 
the early stage, the diseases have no showing or symptoms. The most symptoms occur when the diseases in 
the final stage [4]. Pancreatic cancer is cancer that starts in the pancreas. The most common type of 
pancreatic cancer is pancreatic adenocarcinoma [5]. Location of the pancreatic organs behind the stomach. 
The pancreas is about 6 inches long and less than 2 inches wide in adults [6]. There are various treatments for 
pancreatic cancer, such as surgery, chemotherapy, radiation therapy, or a combination of these. The method 
of treatment is chosen based on the extent of cancer [7]. Information technology has an important role in the 
field of medicine. Cancer is a disease that can be detected by machine learning. Data is very useful in the 
medical field. It can be seen from the development of data mining in medical science is increasing rapidly. 
This increase can be seen from the high prediction results, can reduce treatment costs, increase the chances of 
recovery of patients, and decisions to save lives [8], [9]. Classification is a way to identify groups of 
categories to be part of observations [10]. One general classification is the continuous value of the predictive 
attribute. Whereas, ensemble classification is useful for increasing classification accuracy in ensemble 
applications [11]. 
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2. RESEARCH METHOD 

Pancreatic cancer dataset was obtained from Al-Islam Hospital, Bandung, Indonesia. This dataset 
consists of 79 non-cancer and 124 cancer samples with numerical characteristics described by 6 attributes, as 
shown in Table 1. This research uses logistic regressions and random forest for classification. This method is 
evaluated using 3-fold cross-validation, 45-random state, and later compared. Table sample of dataset is 
shown in Table 2. 


Table 1. Pancreatic cancer dataset variable 


Attributes Description 
Age The number of age patients who are in check 
CA 19-9 The number of cancer antigen units per milliliter of blood 
Hemoglobin The number of hemoglobin gram per deciliter of blood 
Leukocyte The number of leukocyte cell per uL of blood 
Hematocrit Hematocrit or the volume percentage of red blood cells 
Thrombosis The number of thrombosis cell per uL of blood 


Table 2. Sample of dataset from Al-Islam Hospital, Bandung, Indonesia 


‘Age CA 19-9 < 37 Hemoglobin 13-18 Leukocyt 4000-10000 Hematokrit Thrombosis 150000-450000 
& (U/mL) (g/dL) (sel/uL) 40-54 (%) (sel/uL) 
38 34.61 12.1 7600 36.9 244000 
82 35.02 12.1 4900 36.7 253000 
35 35.4 6.3 10100 23.4 496000 
58 35.83 9.8 33500 29.1 467000 
52 36 9.8 7600 29.9 613000 
41 36.03 12.6 3400 38 203000 
40 36.94 11.9 8900 39.8 430000 
51 37.41 6.6 9500 23.5 259000 
64 39.25 11.5 15500 35.3 230000 


2.1. Logistic regressions 

In some cases, the natural complement of ordinary linear is logistic regression. This happens when 
each target variable is categorized. Variable Y is a variable target and dependent with two class and variable 
X is a variable predictor and independent, let g(x) = Pr(X = x) = 1-—Pr(X = x) the logistic regression 
model has a linear form for Logit with probability as follows [12]-[14]: 


Logit| g(x) | = log (2) = a+ x,where the odds “ (1) 


The form of linear approximation and probability logarithm is derived from Logit. The rate of 
increase or decrease of the Shape g(x) curve is denoted by the parameter B [15]. 


2.2. Random forest 

Random forest is a method developed by Breiman in 2001 [16], [17]. Random forest works when it 
reaches maximum accuracy, a decision tree can be used to avoid overfitting data [18]. The estimation process 
previously carried out by decision tree and CART was enhanced by Breiman, which was started by randomly 
selecting m variables from several independent variables. A decision tree or CART method is a tree that is 
grown without pruning. These trees will be selected with the highest accuracy. The procedure of random 
forest depends on the number of classifications [19]. There are some advantages of random forest [20], such 
as overcoming the problem of excessive compatibility, less sensitive to outlier data, parameters can be easily 
adjusted and therefore eliminate the need for tree pruning, and the importance of variables and accuracy are 
generated automatically. Random forest selected features are in agreement with existing domain knowledge 
(e.g. physiological knowledge Guan et al., 2012) [21]. Flowchart of random forest shown in Figure 1. 


2.3. Confusion matrix 

One of the methods used to calculate accuracy in the concept of data mining or decision support 
systems is confusion matrix [22]. It is balanced in the precision and sensitivity that distinguishes correct label 
classifications in different classes [23], [24]. Accuracy is the ratio of the true predictions in the whole data. 
Precision is a true positive prediction ratio compared to overall positive predicted results. In addition, the 
third is sensitivity is a true positive prediction ratio compared to overall true positive data. The last was 
denoted as Fl-score, used to determine the balance between sensitivity and precision. 
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— True Positive (TP): Number of samples having pancreatic cancer diagnosed correctly 
— False Positive (FP): Sum of healthy people that were incorrectly identified to have pancreatic cancer 
— True Negative (TN): Number of healthy people correctly spotted 
— False Negative (FN): Number of samples with pancreatic cancer that were incorrectly classified as 
healthy 
From Table 3 it can build the formula for accuracy, precision, recall (sensitivity), and Fl-score that 
are seen in (2)-(5). 


Accuracy = ————— x 100% (2) 
TP+TN+FP+FN 
sini TP 
Precision = ——x 100% (3) 
TP+FP 
Recall = —— x 100% (4) 
TP+FN 


Piscare a2 yp ee eee) non, (5) 


( Precision+Recall ) 


TRAINNING DATA 
n observations , m predictors 


k Bootstrap 
samples 


k trees 


TEST DATA 
n-N samples 
m predictors 


Average of single trees predictions 


Figure 1. Flowchart of random forest [25] 


Table 3. Confusion matrix 
Recognize Value 


Actual Value 


Positive Negative 
Positive TP FN 
Negative FP TN 


3. RESULTS AND DISCUSSION 

This research using Jupyter notebook as software for running the program of logistic regressions and 
random forest in processing pancreatic cancer classification problem. Testing the accuracy, precision, recall, 
and Fl-score in this type of classification are by changing the amount of data training. In this test, the number 
of data training is equal to 10, 20, 30, 40, 50, 60, 70, 80, and 90 which will be used on the results of the 
dataset. The results of accuracy, precision, recall, and Fl-score which are given by logistic regressions and 
random forest classifier method are shown in Table 4 and Table 5. 

Based on Table 4, it is shown that the number of data training is affecting by the values of accuracy, 
precision, recall, and Fl-score. In this research, the highest accuracy value was recorded when the data 
training is 30% with 96.48% while the lowest accuracy value was recorded when the data training is 70% 
with 91.49%. In precision, 70% and 90% of data training reached a maximum value that is 100%. For the 
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recall, the recall of the highest value is 97.70% for 30% of data training. The last for Fl-score, 30% of data 
training reached the highest value that is 96.29%. 

Based on Table 5, it is shown that the number of data training is affected by the values of accuracy, 
precision, recall, and Fl-score. In this research, the highest accuracy value was recorded when the data 
training is 20% with 99.38% while the lowest accuracy value was recorded when the data training is 90% 
with 89.68%. In precision, 10%, 20%, and 30% of data training reached a maximum value that is 100%. For 
the recall, the recall of the highest value is 99.10% for 10% of data training. The last for Fl-score, 20% of 
data training reached the maximum value that is 100%. 


Table 4. The results of pancreatic cancer classification using logistic regression 
No. Data Training Accuracy (%) Precision (%) Recall (%) _ Fl-Score (%) 


iF 10 95.62 96.56 96.4 95.4 
2. 20 95.68 96.02 96.97 95.45 
3: 30 96.48 96.74 97.7 96.29 
4. 40 95.94 97.44 96 95.73 
J 50 96.05 98.33 95.16 95.88 
6. 60 95.01 96.06 95.83 94.76 
wie 70 91.49 100 86.11 91.39 
8. 80 95.05 96.3 95.83 94.77 
9. 90 94.44 100 91.67 94.29 


Table 5. The result of pancreatic cancer classification using random forest 
No. Data Training Accuracy (%) Precision (%) _Recall(%) _ Fl-Score (%) 


1, 10 98.91 100 99.1 98.85 
2. 20 99.38 100 98.99 100 

3. 30 99.29 100 98.85 99.26 
4. 40 95.12 96.43 98.67 95.54 
os 50 97 96.9 98.41 96.85 
6. 60 97.48 98.04 91.91 98.67 
is 70 91.73 97.44 94.44 96.57 
8. 80 97.62 92.59 91.67 97.17 
9. 90 89.68 93.33 91.67 70.83 


4. CONCLUSION 

After classifying pancreatic cancer with logistic regressions and random forest methods, it gets 
several results of accuracy, precision, recall, and Fl-score. By comparing the values that are given from those 
methods (logistic regressions and random forest), it is possible to conclude that random forest generates a 
better result than logistic regression. The results of the two methods random forest gives the highest accuracy 
rate when the data training is 20% with 99.38%, while logistic regression reaches 96.48% when the data 
training is 30%. Because of the good results, random forest is suggested to help the medical staff to predict or 
classify a disease rather than logistic regression, especially for a dataset that is similar to this research. 
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